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Abstract 

Enhancing the efficiency of solar evaporation is important for solar stills. In this 
study, the weighted values of environment factors (descriptors) on the efficiency of 
solar evaporation are obtained by using a machine learning algorithm, random forest. 
To verify the advancement between random forest and mathematical data analysis, two 
traditional methods, pair wise plots and Pearson correlation analysis, are conducted for 
comparison. Experimental data are obtained from around 100 articles since 2014. The 
results indicated that traditional methods failed at obtaining reasonable weighted values, 
while random forest is competent. It is found that thermal design is the most significant 
descriptors to obtain a high efficiency. The lack of complete dataset is the main 
challenge for more in-depth and comprehensive analysis. This work may promote the 


studies on solar evaporation and solar stills. 
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1. Introduction 

With the population growth and the development of industrial activities and 
agricultural progress, the shortage of fresh water resources is becoming one of the 
catastrophic problems that the world faces. Given that sea water account for 97% 
plant’s water resources, it is desirable to develop technologies for sea water desalination. 
Many effective methods of desalination have been proposed in the past, like multistage 
flashing', reverse osmosis, multi-effect distillation and vapor compression*“ and so 
on. 

Compared to other methods, solar still attracts more and more interest due to its 
eco-friendly, simple construction and maintenance, low installation cost and long life 
operation. Solar evaporation is one of the crucial process in solar still. Thereby, during 
past decades, many methods have been proposed to achieve high efficiency of solar 
evaporation, such as using nanofluid’®, cotton cloth’, sponge and charcoal'®, 

However, the evaporation efficiency is affected by many factors, such as materials 
type, thermal design, ambient temperature, solar intensity and so forth. Therefore, it is 
interesting to show the importance or weighting of different factors. Empirically, 
several important factors can be picked up. However, it is difficult to quantify the 
importance of each descriptor, thus few works discussed this point in solar evaporation 
field. On the other hand, quantitative analysis of the descriptor importance is a 
widespread scientific problem. Such as in the field of chemistry, machine learning 
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technology!!"'4 was used to measure the descriptor importance!>!° for polymer chain 


angle, Random forest for example'’. It is found that machine learning can accurately 
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measure the relationship between different factors (descriptors) and their influence on 
the target value, which present a meaningful first step towards the high-throughput 
screening of polymer chemistry to identify compositions with desirable bulk properties. 

In the current study, three methods are used and compared for obtaining the 
weighted values of each descriptor in solar evaporation. Including two traditional data 
science methods, i.e. pair wise plots (PWP)!* and Pearson correlation analysis (PCA), 
and a machine learning algorithm, random forest (RF)*’. The weighted values also 
called descriptor importance. Firstly, the traditional data science methods are used. Pair 
wise plots and Pearson correlation analysis are used for measuring correlations between 
pairwise descriptors. Then, random forest is conducted for measuring the importance 


between evaporation efficiency and descriptors. The results of descriptor importance 


can eventually help scientists to design high efficiency system of solar evaporation. 


2. Methodology 
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Fig. 1 main flowchart of current study. 


The main flowchart of current study is showed in Fig. 1, experimental data used 


in analysis are collected from around 100 articles since 2014 (Details in Supporting 


Materials). 
Table 1 The details of data representation 
Descriptors Classification Labels Number of samples 
1 Kw 0 51 
Solar intensity 1-10 Kw 1 25 
>10 Kw 2 10 
3D interface 0 53 
Thermal design 2D\1D interface 1 21 
Volumetric 2 12 
<3 cm 0 30 
Surface diameter 3-4 cm 1 29 
>4 cm 2 29 
<0.95 0 23 
Absorptivity 0.95 1 31 
>0.95 2 34 
<24°C 0 21 
Tamb 24-25°C 1 40 
>25°C 2 27 
<50°C 0 37 
Tyapor 50-70°C 1 31 
>70°C 2 20 
<75 % 0 28 
Efficiency 75 %~85 % 1 31 
>85% 2 28 


For the collected dataset, M = {X, y}., y is the objective value (energy efficiency 


of evaporation), X are input descriptors (descriptors) corresponding to y, such as the 


solar intensity, thermal design, surface diameter, absorptivity, Tamb (temperature of 
ambient) and Tinterface (temperature of interface). Due to the lack of details in original 
articles, there are some missing data of surface diameter, absorptivity, Tamb and Tinterface. 
The method of Mean Completer is used for filling missing data. Each descriptor is 
divided into three labels. The detailed distribution of dataset M is list in Table 1. 

Three methods, including pair wise plots (PWP), Pearson correlation analysis 
(PCA), and machine learning algorithm, i.e. random forest (RF) are studied in this work. 
The application of three methods can be summarized as follows: 

2.1 Pearson correlation analysis 

PCA is usually being adopted to quantify the correlation between two different 
descriptors. For descriptors X, and X, in M, such as solar intensity and solar 
absorptivity, PCA can be calculated as Eq. 1: 


ER- -Y (1) 
DG — X22 EC — Xn) 2)? 


The values of PCA are dimensionless and distribute from -1 to 1. The closer to 1 or 
-1, the stronger correlation between two descriptors. The value 0 suggests no correlation. 
If the value is positive, then a positive correlation exists, else a negative correlation 
exists. 
2.2 Pair wise plots 
Pair wise plots can draw scatterplots for descriptor correlation and histograms for 
univariate distributions intuitively. As presented in Fig. 3, the subfigures on diagonal 
represents the data distribution trend of a particular descriptor. The other subfigures can 


help us visibly and qualitatively observe the relationship between each pair of 
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descriptors. If a rising or falling trend is formed on the diagonal, the corresponding pair 
of descriptors have strong correlation. Otherwise, the two are not correlated. 
2.3 Random forest 

Random forest is a typical ensemble method, which combines multiple decision 
trees into one model to improve the performance. It is widely applied in many scientific 
and engineering fields, such as statistics, materials and biology”!””. The main step of 


RF shown in Fig. 2 can be expressed as 


Fig.2 Schematics of applying the random forest in studying the importance of different 


descriptors. 


(1) Data preprocessing 

In the present study, data representation is performed for converting data into 
symbols which can be read by computers. For instance, three types of thermal design, 
3D interface, 2D/1D interface, and volumetric, are represented as 0, 1, and 2 (details in 
Table 1). Finally, the dataset is divided into training set TA and test set TE according to 
a certain ratio. 

(2) Model construction 

Based on a processed dataset TA, the bootstrap resampling method”? is used to 
randomly generate K sets of data. Then, K decision trees will be grown. For example, 
in calculating Fig. 5a, each dataset includes three descriptors (thermal design, 
absorptivity and random descriptor) and the label (efficiency). In each node of a 
decision tree, the node will split the dataset into two parts according to the value of a 
chosen descriptor. After traverse all descriptors, the final node will be the label 
(efficiency) of this data. The prediction of the model is voted by K decision trees. 

(3) Model validation 

Test dataset TE, which is not trained in model construction, is used for judging the 
accuracy of the model. If the accuracy of TA is much higher than the accuracy of TE 
(e.g. 0.9 for TA and 0.5 for TE), the model is considered as overfitting. If the accuracy 
of TE and TA are too low (e.g. 0.5 for TA and 0.5 for TE), the model is considered as 
underfitting. Both overfitting and underfitting are unacceptable, in which cases, the 
model needs to be retrained. If the accuracy of TA and TE is high enough and the 


accuracy of TE is similar as TA, the model is consider as trained well. 
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3. Result and discussion 

As a Starting point, two traditional data science method, Pair wise plots and 
Pearson correlation analysis, are performed for finding the weighted values, i.e. the 
descriptor importance. The results are displayed in Fig. 3 and Fig. 4. In addition to the 
mathematic data analysis, well-established machine learning algorithms are used to 
extract the relationship between descriptors and the target property in materials 
informatics. Subsequently, Fig. 5 showed the results that calculated by machine 
learning algorithm, Random forest. 


3.1 Result of traditional data analysis 
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Fig. 3 Pair wise plots of dataset 


Fig. 3 shows the pair wise plots between different descriptors (environment 
descriptors) and efficiency of solar evaporation. As can be seen from the top row of 
plots in Fig. 3, there is no obvious linear relation between efficiency and other 
descriptors. Moreover, it is clear that the dataset is discrete and not evenly distributed. 


Most of solar intensity data is around 1 kW and the thermal design is set as discrete 
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numbers. Since PWP is the method which focus on the pure mathematic map in dataset, 
it is reasonable that PWP can’t measure the descriptor importance based on defective 


dataset. 
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Fig. 4 Map of Pearson correlation analysis 


The values of PCA are displayed in Fig. 4. As can be seen from all investigated 
descriptors, all absolute descriptors’ values are below 0.3, which means all descriptors 
have weak correlation with the efficiency. Therefore, similar to PWP, it is unlikely to 
draw reasonable results of descriptor importance based on the PCA values. 

3.2 Result of machine learning algorithms 

Fig.S5a-5d showed the descriptor importance quantified by RF of using 2, 3,4, and 
6 selected descriptors, respectively. The results indicated that thermal design is the most 
important descriptor in solar evaporation, among all chosen descriptors. The 
importance of thermal design is at least 2 times higher than other descriptors. This result 
showed that optimizing the heat transfer process in solar evaporation system is essential 


for enhancing the efficiency of solar evaporation. This result is reasonable, because of 
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that when the thermal design is poor, only a small part of solar energy is used for 
evaporation. For example, in a traditional solar evaporation, some heat is used for 
heating the bulk water, instead of for promoting evaporation’. Therefore, the efficiency 
of solar evaporation of volumetric system is lower than that of the interface system. 
Besides, Fig.5 shows that solar intensity is an unimportant descriptor. This is because 
that, with the optimized thermal and material design, high efficiency can be obtained 
for no matter high or low solar intensity as reported in many works”>’. Therefore, solar 
intensity was not important and similar to a random descriptor. Herein, the random 
descriptor is a set of random data which has no relationship to the energy efficiency and 


is used as a benchmark. 
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Fig.5 The result of descriptor importance by using RF. The sum of values of descriptor 


importance equals 1. (a)-(d) are the results of using 2, 3,4, and 6 different descriptors, 


respectively. Surface diameter is the diameter (length) of the evaporation surface. 


Meanwhile, the descriptor importance of solar absorptivity is much lower than 
expectation as shown in Fig.5. Higher absorptivity enables more available energy for 
evaporation and will affect the efficiency a lot. The reason maybe that almost all 
reported works picked up materials with very high absorptivity (>90%), which makes 
the dataset cannot reach ergodicity. Hence, its importance is underestimated in the 
calculation. Besides, the temperature of ambient (Tamb) and evaporation interface 
(Tinterface) are insignificant, which might due to the small difference of Tamb and Tinterface 
between most of works. However, temperature is actual a very important descriptor in 
natural convection based evaporation process**. Therefore, to capture the real 
importance of temperature, more works should be done at different ambient and 
interface temperature. 

It can be concluded that a more accurate calculation by machine learning requires 
more complete data. Compared to other fields such as materials?’?°, they have a 
complete database of physical properties and theoretical calculation methods. 
However, in the current study, the shortage of reported data is a hard problem because 
authors do not provide exact values of some descriptors. For example, values of ambient 
temperature, the diameter of evaporation surface, absorptivity, and the temperature of 
evaporation interface are missing in some papers. Therefore, to obtain a more accurate 
result of machine learning, authors should provide complete dataset of experimental 
descriptors in their future works. On the other side, some other potential important 
descriptors on material design, such as thermal conductivity, contact angle, specific area, 


porosity, characteristic size, functional group and so forth, are not included and 
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calculated by RF in the current stage, because the detailed properties of materials are 
not offered in most of papers. It is worth to be noticed that a full dataset of descriptors 


in research reports will help to push the field forward. 


3.3 Effect of the size of dataset 
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Fig. 6 The results of descriptor importance correspond to three combinations of train/test set. 


As mentioned in other researches, the quality of dataset determines the reliability 
of machine learning algorithms. In order to avoid overfitting and further quantify the 
effect of dataset in the current study, the initial dataset was separated into three 
combinations of train/test set, i.e. 70%/30%, 80%/20% and 90%/10%, respectively. The 
results are summarized in Fig.6. As can be seen, with the decrease of test set, there is 
no obvious difference between different models. Thermal design is the most important 
descriptor in all cases. Absorptivity is the second important descriptor. Other 
descriptors were not important and similar to a random descriptor. It turns out that the 
result of descriptor importance depends on the physical mechanism rather than the 


dataset. This indicates that the current dataset is able to get a convergent solution. 


4. Conclusion 

In conclusion, the importance of factors on efficiency of solar evaporation are 
analyzed by pair wise plots, Pearson correlation analysis, and random forest. 
Experimental data used in analysis are collected from around 100 articles. The results 
indicate that pair wise plots, Pearson correlation analysis can’t measure the descriptor 
importance based on defective dataset. On the contrary, random forest can obtained 
reasonable results. The results by using random forest show that thermal design is the 
most important descriptor that determining the efficiency of solar evaporation. It can 
be concluded that machine learning is helpful to understand the importance of various 
descriptors quantitatively, which will help to push the solar still field forward. 

Although machine learning obtained meaningful results, it should be emphasized 
that due to the limitation of the amount and quality of experimental data in published 
articles, the current analysis is more about qualitative results than quantitative results. 
It is expected that authors can provide more detailed data and standardized descriptors 
in future publication. This will promote the application of machine learning in studying 


solar still. 
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